Quantitative Data Cleaning for Large Databases

نویسنده

  • Joseph M. Hellerstein
چکیده

Data collection has become a ubiquitous function of large organizations – not only for record keeping, but to support a variety of data analysis tasks that are critical to the organizational mission. Data analysis typically drives decision-making processes and efficiency optimizations, and in an increasing number of settings is the raison d’etre of entire agencies or firms. Despite the importance of data collection and analysis, data quality remains a pervasive and thorny problem in almost every large organization. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. As a result, there has been a variety of research over the last decades on various aspects of data cleaning: computational procedures to automatically or semi-automatically identify – and, when possible, correct – errors in large data sets. In this report, we survey data cleaning methods that focus on errors in quantitative attributes of large databases, though we also provide references to data cleaning methods for other types of attributes. The discussion is targeted at computer practitioners who manage large databases of quantitative information, and designers developing data entry and auditing tools for end users. Because of our focus on quantitative data, we take a statistical view of data quality, with an emphasis on intuitive outlier detection and exploratory data analysis methods based in robust statistics [Rousseeuw and Leroy, 1987, Hampel et al., 1986, Huber, 1981]. In addition, we stress algorithms and implementations that can be easily and efficiently implemented in very large databases, and which are easy to understand and visualize graphically. The discussion mixes statistical intuitions and methods, algorithmic building blocks, efficient relational database implementation strategies, and user interface considerations. Throughout the discussion, references are provided for deeper reading on all of these issues.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovering Informative Patterns and Data Cleaning

We present a method for discovering informative patterns from data. With this method, large databases can be reduced to only a few representative data entries. Our framework encompasses also methods for cleaning databases containing corrupted data. Both on-line and off-line algorithms are proposed and experimentally checked on databases of handwritten images. The generality of the framework mak...

متن کامل

A Machine Learning Approach to Data Cleaning in Databases and Data Warehouses

The problems of data quality and data cleaning are inevitable in data integration from distributed operational databases and online transaction processing (OLTP) systems (Rahm & Do, 2000). This is due to the lack of a unified set of standards spanning over all the distributed sources. One of the most challenging and resource-intensive phases of data cleaning is the removal of fuzzy duplicate re...

متن کامل

Cleaning Up Very Large Databases and Keeping Them Clean

This presentation shows a real-world example of how a very large Customer database was cleansed and de-duplicated to shrink it down to a manageable size. The techniques used to do this are shown, as well as the processes that were implemented to maintain the new level of data cleanliness. The tricks and techniques are applicable to customer files or databases of any size in any business. Actual...

متن کامل

Data Mining & Knowledge Discovery in Databases: An AI Perspective

Data mining and Knowledge discovery has several important application areas. Data mining and knowledge discovery have been topics considered at many AI, database and statistical conferences. Knowledge discovery generally refers to the process of identifying valid, novel and understandable patterns. Knowledge discovery from large databases, often called data mining, refers to the application of ...

متن کامل

Rank-based strategies for cleaning inconsistent spatial databases

A spatial dataset is consistent if it satisfies a set of integrity constraints. Although consistency is a desirable property of databases, enforcing the satisfaction of integrity constraints might not be always feasible. In such cases the presence of inconsistent data may have a negative effect on the results of data analysis and processing and, in consequence, there is an important need for da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008